Bayesian Data Cleaning for Web Data
نویسندگان
چکیده
Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns–CFDs–from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting a clean training sample is feasible in enterprise data scenarios, it is infeasible in web databases where there is no separate curated data. CFD based methods are unfortunately particularly sensitive to noise; we will empirically demonstrate that the number of CFDs learned falls quite drastically with even a small amount of noise. In order to overcome this limitation, we propose a fully probabilistic framework for cleaning data. Our approach involves learning both the generative and error (corruption) models of the data and using them to clean the data. For generative models, we learn Bayes networks from the data. For error models, we consider a maximum entropy framework for combing multiple error processes. The generative and error models are learned directly from the noisy data. We present the details of the framework and demonstrate its effectiveness in rectifying web data.
منابع مشابه
Data Cleaning of Medical Data for Knowledge Mining
Data mining or data analysis in biomedicine is different from other research fields, because the data in biomedical are heterogeneous and, and they are from different sources. Data from different medical sources are voluminous, each of the resources may have different data structure or data schema, the data quality is also different. Moreover, each physician may have its own interpretation with...
متن کاملAn Efficient Algorithm for Data Cleaning of Web Logs with Spider Navigation Removal
The World Wide Web is growing massively larger with the exponential growth of websites providing the user with heaps of information. Text files called as web logs are used to store the clicks of a user whenever a user visits a website. Web usage mining is a stream of web mining that involves the applications of mining techniques to be applied on the server logs containing the user clickstreams....
متن کاملScalable Probabilistic Framework for Improving Data
Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which hav...
متن کاملQuality Assurance of Government Databases
Data cleaning is a vital process that ensures the quality of data stored in real-world databases. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. Digital government serves as an emerging area for database research, such as database management, data integration, da...
متن کاملProbabilistic Models for Anomaly Detection in Remote Sensor Data Streams
Remote sensors are becoming the standard for observing and recording ecological data in the field. Such sensors can record data at fine temporal resolutions, and they can operate under extreme conditions prohibitive to human access. Unfortunately, sensor data streams exhibit many kinds of errors ranging from corrupt communications to partial or total sensor failures. This means that the raw dat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1204.3677 شماره
صفحات -
تاریخ انتشار 2012